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INTRODUCTION 


A  large  number  of  pattern  recognition  tasks  require  the  ability  to  recognize  patterns  within  data  when 
the  character  of  the  patterns  may  change  with  time.  Examples  of  such  tasks  are  remote  sensing, 
autonomous  control,  and  automatic  target  recognition  in  a  changing  environment.  (Titterington  et  al. 
[1985],  Chapter  2,  gives  a  list  of  tasks  to  which  mixture  models  have  been  applied.  Many  of  these  tasks, 
and  their  variants,  fall  into  the  above  categories.)  These  tasks  have  a  common  requirement:  the  need  to 
recognize  new  entities  as  they  enter  the  environment.  A  pattern  recognition  system  must  be  able  to 
recognize  and  develop  a  representation  of  a  new  pattern  in  the  environment  as  well  as  to  change  its 
representation  of  the  statistics  of  the  pattern  dynamically. 

The  adaptive  mixtures  approach  presented  here  uses  density  estimation  to  develop  decision  funciions 
for  supervised  and  unsupervised  learning.  Much  work  in  performing  density  estimation  in  supervised  and 
unsupervised  situations  has  been  done.  For  the  most  part,  this  research  has  centered  on  approaches  that 
use  a  great  deal  of  a  priori  information  about  the  structure  of  the  data.  In  particular,  parametric 
assumptions  are  often  made  concerning  the  underlying  model  of  the  data.  While  these  approaches  yield 
impressive  results,  nonparametric  approaches  (Tapia  and  Thompson  [1978],  Prakasa  Rao  [1980])  free  of 
a  priori  assumptions  can  be  considered  more  powerful  due  to  their  increased  generality  and  therefore 
wider  applicability.  Developing  a  system  for  performing  unsupervised  learning  nonparametrically  (that  is, 
devoid  of  restricting  assumptions)  is  a  daunting  task.  In  fact,  there  are  many  instances  in  which  no  system 
can  be  assured  of  performing  properly.  For  example,  two  classes  with  identical  distributions  cannot  be 
identified  as  such  based  on  purely  unsupervised  learning.  Nevertheless,  a  nonparametric  density 
estimation  approach  to  unsupervised  learning  can,  in  many  cases,  lead  to  a  general  and  powerful  pattern 
recognition  tool. 

In  addition  to  the  nonparametric  assumption,  we  also  consider  the  problem  of  recursive  estimation 
(Titterington  et  al.  [1985],  Chapter  6;  Nevel’son  and  Has’minskii  [1973];  Young  and  Calvert  [1974]). 
That  is,  it  is  assumed  that,  due  to  high  data  rates  or  time  constraints,  we  must  develop  our  estimates  in 
such  a  way  that  they  do  not  require  the  storage  or  processing  of  all  observations  to  date.  This  also  limits 
our  ability  to  develop  optimal  estimates,  but  often  is  the  only  approach  for  a  given  application. 

By  virtue  of  addressing  the  types  of  applications  that  can  be  termed  recursive  and  nonparametric,  we 
have  at  once  made  the  problem  more  difficult  and  more  interesting.  The  recursive  assumption  eliminates 
the  possibility  of  using  iterative  techniques.  It  is  necessary,  by  hypothesis,  to  develop  our  estimate  at  time  t 
only  from  our  previous  estimate  and  the  newest  observation.  The  nonparametric  assumption  implies  that 
we  cannot  make  any  but  the  simplest  assumptions  about  our  data.  Realistic  restrictions  on  processing  and 
memory,  as  might  be  imposed  in  automatic  target  recognition,  remote  sensing,  and  automatic  control 
applications,  in  conjunction  with  high  data  rates,  make  such  applications,  and  the  procedure  discussed 
here,  an  important  subject  in  pattern  recognition. 

In  this  work,  we  apply  statistical  pattern  recognition  concepts  to  the  problem  of  recursive 
nonparametric  pattern  recognition  in  dynamic  environments.  We  begin  with  a  formal  description  of 
pattern  recognition  in  this  context.  Next,  we  present  a  formulation  of  the  learning  problem  (supervised 
and  unsupervised).  While  this  concept  has  been  studied  by  many  authors  (Yakowitz  [1969,  1970],  O'Neill 
[1978],  Greblicki  [1980],  Postaire  [1982],  Niemann  and  Sagerer  [1982],  Gowda  and  Krishna  [1979]), 
little  work  has  been  done  in  the  recursive  nonparametric  unsupervised  problem.  Adaptive  mixtures,  a 
method  for  performing  both  supervised  and  unsupervised  learning,  is  then  developed.  Similarities  exist 
between  adaptive  mixtures  and  potential  functions  (Aizerman  et  al.  [1966]),  maximum  penalized 
likelihood  (Good  and  Gaskins  [1971],  Tapia  and  Thompson  [1978]),  and  reduced  kernel  estimators 
(Fukunaga  and  Hayes  [1989].)  We  conclude  with  a  discussion  of  adaptive  mixtures  and  their  relationship 
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to  learning.  The  two-category  problem  is  considered  throughout.  The  results  can  be  extended  to 
multicategory  problems  easily  (e.g.,  successive  dichotomies).  In  addition,  univariate  assumptions  are  made 
in  places  for  clarity.  The  following  discussion  will,  we  hope,  allow  for  meaningful  discussion  of  recursive 
nonparametric  learning  as  well  as  provide  a  useful  problem  definition  and  approach  from  which  to  begin 
addressing  specific  applications. 


MOTIVATION 


Learning  techniques  are  useful  in  a  broad  class  of  pattern  recognition  problems.  In  this  section,  we 
discuss  their  application  to  problems  requiring  recursive  and  nonparametric  processing.  We  begin  with  a 
formulation  of  the  general  pattern  recognition  problem.  Next,  we  present  density  estimation  as  an 
approach  to  pattern  recognition.  We  will  then  consider  the  problem  of  pattern  recognition  in  a  dynamic 
environment  and  the  need  for  both  supervised  and  unsupervised  learning  techniques  in  its  solution. 

We  will  adopt  the  following  conventions  on  notation.  A  parenthesized  superscript,  e.g.,  C(0,  will 
denote  an  element  of  an  array  or  set.  Time  is  considered  to  be  discrete,  t  =  0,1,2,...  and  a  subscript  t, 
e.g.,  Ft ,  will  denote  indexing  by  time.  Thus,  will  indicate  that  the  vector  D  may  be  a  function  of 
time,  while  £>(,)  will  be  used  when  D  is  independent  of  time. 

PATTERN  RECOGNITION 

Let  Q  =  {C(1),  C<2) . .  CiN)}  be  a  set  of  classes,  or  patterns.  We  wish  to  model  the  problem  of 

determining  which  pattern  is  present  at  a  given  time  t  on  the  values  of  a  number  of  attributes  or  features 
associated  with  these  patterns.  The  particular  time  sequence  of  patterns  presented  is  unknown.  For  this 
reason,  it  is  standard  to  construct  a  probabilistic  model  of  pattern  recognition.  The  patterns  will  be 
considered  as  the  elementary  events  in  a  probability  space,  constructed  as  follows:  Define  on  B,  the 
o-  algebra  generated  by  Q,  a  probability  measure  n(  indexed  by  time,  such  that  for  t  fixed, 
n,({C(0})  =  where 


N 


I*!"  =  * 


Thus,  (Q,  B,  IIt)  defines  a  probability  space  for  each  t,  and  n\'\  cn.ied  the  class  probability,  will 
correspond  to  the  probability  that  at  time  t  a  given  pattern  C(,)  will  be  observed. 

We  assume  there  are  m  real-valued  features  measured  at  each  occurrence  of  a  pattern,  yielding  the 
random  observation  sequence  xt  :  £2  -  ->  9lm  (Euclidean  m- space).  In  many  problems  of  interest,  the 
measurements  of  the  feature  values  of  a  particular  pattern  may  vary  randomly  from  one  presentation  of 

the  pattern  to  the  next.  When  this  occurs,  write  xt(C(l))  =  zj0,  with  Z<  =  (Z^ . z\N)),  a 

(vector-valued)  random  variable,  and  the  distribution  function  of  denoted  by  F((,)  (we  write  Z(t°  ~~ 
f\1)  )■  If  m  is  the  number  of  features  measured  at  each  observation,  then  each  Zt(,)  is  an  m-vector.  The 
distribution  of  feature  vectors  corresponding  to  a  single  class  models  (1)  noise  in  the  measurement  of 
features  and  (2)  possible  systematic  but  unknown  variations  among  exemplars  of  the  same  class.  The 
randomness  in  the  choice  of  class  and  the  randomness  of  the  feature  measurements  give 

i  -  l 
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If  Fx  =  Ft  for  all  r,  r'(  we  say  the  stochastic  process  X  is  stationary.  If,  on  the  other  hand,  there  exists 
r,  t',  such  that  F,  *  F>,  then  X  is  a  nonstationary  stochastic  process. 

The  pattern  recognition  system  S  is  required  to  map  an  observation  Xt  to  a  pattern  C(,).  That  is,  S 
requires  a  decision  function  d  :  -  ->  Q  .  Furthermore,  at  any  time  t  =  r,  S  may  be  required  to  make  a 

decision  based  on  its  processing  of  observations  {X,}*=1  up  to  time  r.  Generally  there  is  a  cost  associated 
with  misclassifications,  and  so  d  must  be  chosen  to  minimize  the  expected  value  of  some  given  cost,  or 
loss,  function.  These  conditions  are  quite  reasonable  for  real-time  applications  in  autonomous  control, 
remote  sensing,  or  automatic  target  recognition. 

The  problem  at  hand  then  is  the  construction  of  the  decision  function  d.  This  generally  involves 
constructing  estimates  of  the  statistics  of  the  problem.  Following  Kendall  [1966],  we  define  two  distinct 
approaches  to  the  tasks  to  be  performed:  classification  and  discrimination.  Discrimination  can  be 
described  as  a  supervised  task.  Based  on  a  set  of  observations  for  which  the  true  class  of  origin  is  known 
(the  teaching  set),  we  wish  to  construct  a  method  for  assigning  a  new  observation  of  unknown  origin  to  the 
correct  class,  yielding  a  decision  function  d  :  — >  Q  .  Classification,  on  the  other  hand,  is  an 

inherently  unsupervised  task.  Based  on  a  set  of  observations  of  unknown  class,  one  decides  whether 
groups  exist  within  this  data  set.  If  so,  one  attempts  to  construct  a  method  of  assigning  new  observations  to 
the  correct  class,  again  constructing  a  decision  function. 

In  the  supervised  task,  the  pattern  recognition  system  is  intended  to  infer  the  statistics  of  the  problem 
from  the  training  set,  for  which  it  is  told  from  which  class  the  observations  are  drawn.  That  is  to  say,  we 
observe  Xt,  with  the  added  information  that  the  observation  is  from  a  given  class  C(,).  Thus,  the  supervised 
problem  naturally  breaks  down  to  separate  estimates,  first,  of  the  distribution  of  the  individual  classes  and, 
second,  of  the  class  probabilities  n,  .  In  the  unsupervised  learning  problem,  it  is  certainly  possible  to 
estimate  the  distribution  Ft  of  xt,  but  it  is  much  harder  to  determine  the  decomposition  of  the  underlying 
mixture 


i-p-p 

i  =  1 

As  one  might  guess,  conditions  must  be  placed  on  this  underlying  mixture  in  order  for  this  problem  to  be 
solvable,  or  even  well  posr  !  in  the  sense  of  the  identifiability  of  the  mixture  (see  Titterington  et  al. 
[1985].) 

Much  of  what  follows  pertains  to  classification  and  discrimination.  While  the  nature  of  tasks  to  be 
performed  becomes  more  complicated  as  we  build  to  the  dynamic  environment  scenario,  the  requirements 
of  our  pattern  recognition  system  S  can  normally  be  thought  of  as  analogous  to  these  two  tasks.  In  general, 
we  have  available  a  teaching  data  set  {•*»}“=! ,  for  which  the  true  class  is  known,  and  untagged  observations 
{•K<}“=a+i.  for  which  the  true  class  is  unknown.  We  wish  to  perform  discrimination  based  on  {xj“=i  and  use 
the  decision  function  d  derived  during  this  process  to  assign  a  class  to  the  observations  (xJrU+i  •  We  would 
like,  if  possible,  to  use  {xjf=a+i  to  update  (and  improve)  d(*).  Using  these  data,  for  which  the  true  class 
is  unknown,  entails  unsupervised  learning. 


DENSITY  ESTIMATION  FOR  STATISTICAL  PATTERN  RECOGNITION 

While  there  are  numerous  approaches  to  constructing  decision  functions  (Fukunaga  [1972],  Chen 
[1973]),  we  concentrate  on  density  estimation  techniques.  This  places  the  burden  on  the  development  of 
an  overall  estimate  D  for  all  observations,  and  estimates  D(l)  for  each  C(i)  e  Q.  Developing  the  D(l)  will 
allow  the  straightforward  construction  of  a  decision  function  d(»),  namely 
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(1) 


d{ x)  s  c<a)3D(«>(x)  =  max 

/ 

or 


d(x)  =  C(a)BnwDta)(x)  =  max  nU)D0)( x) 

i 


(2) 


if  an  estimate  for  the  prior  probabilities  n  can  be  made  (see  Lawoko  and  McLachlan  [1989]  for  a 
discussion  of  bias  in  the  estimation  of  these  priors).  Similar  equations  hold  for  the  time-varying 
distributions.  In  the  unlikely  case  where  no  unique  maximum  exists,  a  tie  is  said  to  occur  and  any 
arbitrary  method  may  be  used  to  break  it  (e.g.,  a  coin  flip).  Note  that  we  are  not  calculating  the  decision 
surface  direcdy.  Rather,  we  are  evaluating  each  estimate  D(i)x. 

When  we  use  density  estimation  techniques  for  developing  a  decision  function,  we  obtain  probability 
information  as  well  as  simple  classification  information.  Furthermore,  if  we  define  the  probability  of 
misclassification  when  using  a  given  decision  function  d(»)  as  Pd(e)  ,  then  the  Bayesian  discriminant 
dB  minimizes  Pd{e),  where  ds  is  defined  by 


dfi(x)  =  max 
i 


*(«)£><«)(*) 

D(x) 


(3) 


Finally,  convergent  estimates,  d|°(x)->  D(,)(x)as  n  ->  °o ,  can  yield  Pd(e)  ->  Pfi(e)  =  Popt(e)  (see 
e.g.,  Wolverton  and  Wagner  [1969],  Greblicki  and  Pawlak  [1983]).  This  adds  justification  to  the  use  of 
density  estimates  in  constructing  decision  functions. 

NONPARAMETRIC  DENSITY  ESTIMATION 

There  are  two  basic  approaches  to  density  estimation:  parametric  techniques  and  nonparametric 
techniques.  The  idea  behind  parametric  estimation  is  to  assume  that  the  data  come  from  a  family  of 
densities  parameterized  by  a  set  of  parameters  (e.g.,  mean  and  variance),  and  to  estimate  the  parameters 
from  the  data.  This  assumes  some  knowledge  of  the  data  which  may  or  may  not  be  available.  In 
nonparametric  estimation,  the  density  is  not  assumed  to  be  from  a  parameterized  family,  but  is  rather 
assumed  to  satisfy  some  other  criteria  (such  as  smoothness  of  the  density).  A  familiar  nonparametric 
technique  is  the  histogram.  In  the  histogram,  the  choice  of  bin  width  determines  the  “smoothness”  of  the 
estimate,  in  some  sense. 

Generally  speaking,  parametric  methods  are  preferable  to  nonparametric  methods  if  there  is  good 
reason  to  believe  the  model  is  correct  and  a  suitable  method  is  available  with  which  to  estimate  the 
parameters.  Often,  however,  little  is  known  about  the  true  distribution  of  the  different  classes,  and  so  a 
nonparametric  technique  is  required.  Also,  for  the  purposes  of  classification  and  discrimination,  a  fair 
approximation  of  the  density  is  often  sufficient,  and  as  such,  the  extra  accuracy  that  might  be  afforded  by 
a  parametric  model  may  not  be  worth  the  risk  associated  with  an  incorrect  model  or  the  computational 
effort  required. 

The  algorithm  described  below  evolved  from  the  study  of  two  common  density  estimates.  The  first  of 
these,  the  kernel  estimator  (Parzen  [1962],  Silverman  [1986]),  is  a  truly  nonparametric  technique  that 
places  a  fixed  density,  the  kernel,  at  every  data  point.  This  essentially  requires  the  storage  of  all  the  data, 
and  an  evaluation  of  the  kernel  estimator  can  be  quite  computationally  intensive  (although  there  are 
techniques  to  speed  up  the  evaluation,  see  Silverman  [1986]). 
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In  Eq.  4,  the  kernel  K  is  a  probability  density  function  (though  this  requirement  can  be  relaxed)  and 
the  points  are  the  observations.  Note  that  if  A"  is  a  density,  then  so  is  D.  Another  useful  feature  of  the 
kernel  estimator  is  that  the  smoothness  properties  (differentiability)  of  the  kernel  is  inherited  by  the 
estimate. 

A  related  technique  is  called  the  reduced  kernel  estimator  (RKE)  (Fukunaga  and  Hayes  [1989]). 
Here  the  idea  is  to  reduce  the  original  data  set  to  a  smaller  representative  set,  and  then  compute  the 
kernel  estimator  for  this  smaller  set.  This  reduces  the  computational  complexity  of  the  estimator  once  the 
reduced  data  set  has  been  chosen,  but  the  reduction  of  the  data  set  can  be  quite  computationally 
intensive.  This  idea  of  reducing  the  number  of  kernels  in  the  kernel  estimator  is  the  basis  for  the  algorithm 
described  below,  though  it  will  be  developed  from  a  different  perspective  than  the  kernel  estimator. 

In  addition  to  the  storage  and  computational  requirements  of  the  kernel  estimator,  there  is  the 
necessity  of  the  choice  of  h.  This  is  critical  to  the  performance  of  the  estimator,  and  a  large  body  of 
literature  addresses  this  subject.  Although  the  storage  and  computational  requirements  are  less  for  the 
RKE,  the  choice  of  h  is  still  cridcal. 

Another  technique  for  density  estimation,  at  least  superficially  related  to  the  kernel  estimator,  is 
known  as  the  method  of  mixtures: 

4 

D(x)  =  £*,*(*.$,)  (5) 

/  l 

This  is  really  a  parametric  technique,  which  fits  the  data  to  a  mixture  of  a  fixed  number  of  densities.  The 
resultant  estimate  requires  a  knowledge  of  q,  the  number  of  components,  or  subpopulations,  of  the  data. 
There  is  also  a  nonparametric  technique:  fitting  the  data  to  a  mixture.  This  is  nonparametric  in  the  same 
sense  that  the  histogram  is.  Rather  than  fitting  a  “mixture”  of  fixed  non-overlapping  rectangles  to  the 
data,  as  the  histogram  does,  this  fits  a  mixture  of  a  given  density  to  the  data.  Discussions  can  be  found  in 
Titterington  et  al.  [1985],  Everitt  and  Hand  [1981],  and  McLachlan  and  Basford[1988]. 

The  adaptive  mixture  approach  is  a  combination  of  these  two  ideas.  It  approximates  the  density  as  a 
mixture,  as  in  the  nonparametric  version  of  the  mixture  method.  The  number  of  components,  q,  is  chosen 
from  the  data.  Like  the  RKE,  the  adaptive  mixture  chooses  a  subset  of  the  data  at  which  to  place  a 
kernel.  Unlike  the  RKE,  this  is  done  recursively.  Also,  unlike  the  RKE,  the  points  which  do  not  get 
kernels  are  still  used  to  update  the  estimate. 

For  many  problems,  it  is  impractical  to  process  the  entire  data  set  at  once.  The  data  set  may  be  very 
large,  and  a  technique  which  iterates  through  the  data  repeatedly  would  be  impractical.  The  data  may  be 
nonstationary,  with  no  a  priori  model  of  character  of  nonstationarity  available.  The  problem  may  require 
real-time  processing,  forcing  the  system  to  respond  to  each  input  with  very  little  processing  time  allowed. 
In  situations  such  as  these,  a  recursive  technique  is  preferable  to  an  iterative  one.  In  an  iterative 
technique,  the  entire  data  set  is  processed  for  each  update  of  the  system.  Typically,  many  updates  are 
required  for  the  system  to  converge  to  an  acceptable  solution.  In  a  recursive  system,  each  update  requires 
only  the  current  state  of  the  system  and  the  most  recent  data  point.  This  greatly  reduces  the 
computational  complexity,  and  can  allow  the  tracking  of  nonstationary  distributions,  as  will  be  seen  below. 
As  intuition  indicates,  on  a  fixed  data  set,  an  iterative  technique  which  sees  the  data  many  times  will 
generally  outperform  a  recursive  technique  which  sees  the  data  once.  Often,  an  iterative  technique  can  be 
modified  to  run  recursively,  and  vice  versa,  so  this  is  more  a  concern  of  implementation  than  of  algorithm 
choice . 


The  above  considerations,  in  the  presence  of  high  data  rates,  indicate  the  suitability  of  a  recursive 
stochastic  approximation  procedure  (as  opposed  to  an  iterative  procedure)  for  estimating  the  various 
densities  A  parametric  approach,  such  as  finite  mixture  models,  is  well  suited  to  such  a  task 
(Titterington  [1984]).  These  procedures  will  take  the  form 

9m  t  =  0,1,...  (6) 

where  6  represents  the  parameters  to  be  estimated  and  D(x\6)  is  the  corresponding  density.  6t  denotes 
the  estimated  parameters  after  t  observations  and  x(+!  denotes  the  (t  +  l)i(  observation.  If  we  are  required 
to  give  an  estimate  for  D(x\0)  at  time  t  =  r,  our  response  would  therefore  be  £>(jc;0t).  d>(  is  the 
(time-varying)  parameter  update  function,  about  which  much  will  be  said  later. 

The  parametric  assumption  in  Eq.  6  is  valid  only  for  situations  in  which  we  know  that  the  densities 
being  modelled  all  belong  to  a  common  family,  D,  or  we  are  willing  to  make  this  assumption. 


DYNAMIC  ENVIRONMENT 


Let  us  now  consider  the  extended  problem  in  which  the  total  number  of  classes,  and  thus  the 
distributions  D(,)  from  which  our  xt  can  be  drawn,  is  finite  but  not  constant  over  time.  For  simplicity  of 
exposition,  we  will  assume  that  the  densities  £>(1)  are  stationary.  In  order  to  retain  our  formalism,  we  will 
consider  Q  to  be  the  set  of  all  classes  which  appear  during  the  operation  of  the  classifier,  with  |Q|  =  N. 
Let  Nt  be  the  number  of  classes  present  at  time  t,  that  is,  the  number  of  classes  C(,)  for  which  the  class 
probability  is  nonzero.  Then  a  new  class  entering  the  environment  at  time  t  =  r  corresponds  to 
N*  =  Nj.j  +  L  Let  class  C(Nl)  enter  into  Q  at  time  t  =  r,  and  remain  a  member  of  Q  until  time  t-x'. 
Then  Xt  —  2  jrj°  DU)  for  t  G  [r,  r') .  That  is  to  say,  the  observations  xt  can  be  drawn  from  distribution 


i  =  i 

DiN,)  for  r  S  ( <  i 


*r-l 


For  (  ^  [r, r'),Xt—  2  and  Xt  will  not  be  drawn  from  distribution 

i  =  i 

D(Nx).  Note  that,  since  we  are  assuming  ji\1)  >  0,  the  proportions  n\l)(i  =  1 . Nx)  must  be  adjusted 

during  the  period  of  time  that  C(A/r)  is  in  our  environment  (j  G  [r,  r')).  F°r  the  simplest  case,  the  class 
probabilities  n\l)  remain  constant  in  the  regions  U  £  [0,  r-  1],  and  t  G  [r,  r').  This  corresponds  to  a 
simple  kind  of  nonstationarity  in  the  overall  distribution  D  that  can  be  termed  a  “jump”  nonstationarity.  A 
good  deal  of  work  has  been  done  in  detecting  changes  such  as  these  in  stochastic  processes  in  the  simplest 
case  (Bansal  and  Papantoni-Kazakos  [1986,  1989]). 


Let  us  now  consider  additionally  that  the  individual  D(1)  be  nonstationary  (that  is,  time  dependent,  or 
drifting).  Thus,  £>(h is  a  function  of  time  as  well,  and  we  write  it  .  Now  we  allow  for  times 

t  *  x'.  The  density  is  allowed  to  change  with  time,  and  therefore  the  ability  to  track  such  a  change 
(or  drift)  is  necessary.  This  condition,  together  with  a  dynamic  Nt,  yields  what  will  be  termed  a  dynamic 
environment. 


FORMULATION  OF  THE  LEARNING  PROBLEM 


The  development  of  the  decision  function  d(«)  can  be  partitioned  into  two  distinct  phases;  analogous 
to  the  classification,  discrimination  distinction:  teaching  supervised  and  unsupervised  learning. 
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TEACHING  SUPERVISED 


The  first  phase  is  supei  vised  learning.  A  knowledge  of  the  characteristics  of  the  individual  densities 
D(,)  (or  D\l)  if  the  densities  are  nonstationary)  is  incorporated  into  S  during  supervised  learning.  This 
knowledge  can  be  either  a  priori  information  about  the  densities  or  derived  from  a  teaching  data  set 
{jtf}4?*!  for  which  the  true  class  is  known.  Depending  on  a  and  the  time  constraints  placed  on  the 
supervised  learning  process,  it  can  be  performed  either  iteratively  or  recursively.  There  has  been  much 
work  on  this  supervised  learning  problem  (Duda  and  Hart  [1973],  Sklansky  and  Wassel  [1981]),  though  a 
satisfactory  general  approach  for  the  recursive  nonparametric  scenario  has  yet  to  be  developed.  The 
adaptive  mixtures  described  in  the  next  section  are  developed  to  help  fill  this  void. 

Once  this  supervised  learning  has  been  accomplished,  the  decision  function  d{*)  can  easily  be 
obtained.  The  decision  surfaces  generated  by  the  densities  are  exactly  the  decision  surfaces  required  by 
d(»).  Note  that  a  loss  function  L  defined  on  Q  X  Q  can  easily  be  incorporated  into  the  construction  of  the 
loss  function  d,  as  sketched  in  Duda  and  Hart  [1973]. 

S  can  now  classify  new  observations  Xt  (t  >a),  for  which  the  true  class  is  unknown,  using  d{*). 
However,  if  our  teaching  set  {**}  (=i  is  not  representative  of  the  untagged  observations  {x<}"=a+i. 
classification  based  on  the  derived  decision  function  d  will  not  be  especially  good.  In  fact,  if  there  is 
information  in  the  untagged  observations,  then  a  system  which  does  not  use  this  information  to  improve  its 
decision  function  cannot  be  considered  optimal.  Furthermore,  if  the  distributions  of  the  known  classes 
drift,  or  if  new  classes  enter  the  environment,  unsupervised  learning  must  be  performed  if  the  system  is  to 
perform  well  in  this  changing  environment. 


UNSUPERVISED  LEARNING 

In  the  dynamic  environment  scenario  described  in  the  previous  section,  there  are  three  cases  of 
unsupervised  learning  that  may  be  considered.  These  all  approach  the  task  of  updating  the  decision 
function  d  by  using  information  derived  from  observations  for  which  the  true  class  is  unknown.  The  first 
case  involves  updating  the  distribution  estimate  D corresponding  to  a  known  class  C<0  for  which  the 
teaching  observations  were  not  representative.  This  nonrepresentative  assumption  implies  that  one  can 
perhaps  improve  d\‘\  and  hence  d,  by  using  information  contained  in  the  untagged  observations 
{Xt},"a+1.  Let  xV(dud2)  be  some  error  function  defined  on  densities  d\  and  d2.  Let  £>^°be  the  density 
associated  with  some  C(,)  E  Q.  Let  D&  be  the  estimate  of  obtained  via  {*,}  %i  (supervised 
learning).  If  for  some  time  (}  >  a  we  have 

xP(D(ti\Dp<V(D(li),Dp  (7) 


then  we  say  unsupervised  learning  has  improved  our  estimate  of  d\1)  relative  to  W.  If,  on  the  other  hand 

W(Df),  Dp  <  W(Df<0,  Dp  (8) 

we  say  that  the  estimate  of  D\l)  has  been  degraded  (relative  to  V  )  by  the  u.l.  process. 

Also  of  particular  interest  in  unsupervised  learning  is  the  modelling  of  the  classes  C^(J>Na)  that 
entered  Q  after  teaching  was  finished.  It  is  clear  that  based  on  the  teaching  set  {*<}  m,  there  exists  no 
information  in  d(»)  for  these  classes.  All  such  classes  will  therefore  be  termed  unknown  classes  and 
denoted  by  U(J)(j>Na),  and  we  desire  that  our  u.l.  algorithm  develop  model  densities  for  these  U®. 
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This  requires  the  system  to  first  recognize  that  an  observation  does  not  belong  to  any  of  the  current 
classes,  and  then  construct  a  model  for  the  new  class,  which  it  can  then  update  as  new  points  are  received. 

Modelling  the  includes  both  developing  an  overall  density  Du  for  all  observations  classified  as 
unknown  and  partitioning  Du  into  individual  densities  D®  corresponding  to  estimates  of  unknown  classes 
0^(J>Na)  for  which  teaching  observations  are  not  available.  (Notational  convention:  U^>  indicates  a 
member  of  Q  for  which  no  teaching  observations  are  available.  0®  indicates  an  unknown  class  developed 
by  an  unsupervised  learning  system  in  the  course  of  its  modelling  of  bu .) 

There  are  two  levels  of  recognition  that  S  must  now  perform:  recognition  that  an  untagged  observation 
xt  is  not  from  one  of  the  known  classes  C(,)  (and  therefore  tagging  xt  as  “unknown”),  and  distinguishing 
between  £/</’>  and  U^(j  *  k)  as  the  correct  unknown  class  of  origin  for  the  observation,  if  possible.  These 
two  recognition  levels  correspond  to  using  the  Du  against  individual  in  construction  d.  The  latter 
task  is  evidently  more  difficult  than  the  former,  due  to  the  added  complexity  of  partitioning  Dv . 

To  recap,  we  have  three  types  of  unsupervised  learning  we  would  like  to  incorporate  into  our  pattern 
recognition  system  S  after  the  initial  supervised  phase:  updating  our  density  estimates  associated  with 
known  classes  C(l),  developing  an  estimate  for  the  overall  density  of  unknown  observations,  and 
partitioning  i)(U) 

into  D &  corresponding  to  individual  unknown  classes  The  adaptive  mixtures 

developed  next  are  applicable  to  each  of  these  tasks. 


DEVELOPMENT  OF  ADAPTIVE  MIXTURE  APPROACH 


A  formulation  of  the  learning  problem  in  four  categories  has  been  developed.  We  will  now  introduce 
an  approach  capable  of  performing  recursive  nonparametric  learning  in  each  of  these  categories:  adaptive 
mixtures. 

We  will  develop  the  adaptive  mixture  from  density  estimation  techniques  of  kernel  estimation,  finite 
mixture  modelling,  and  stochastic  approximation  (s.a.).  The  extension  of  the  adaptive  mixture  beyond 
these  techniques  will  allow  for  the  modelling  of  dynamic  environments.  For  simplicity  of  exposition,  we 
will  focus  on  the  estimation  of  a  single  density.  This  should  be  thought  of  as  one  of  the  class  densities 

d\'\ 

FINITE  MIXTURES 

Consider  for  the  moment  the  problem  of  estimating  the  components  of  a  Gaussian  mixture.  That  is, 
we  assume  that  our  density  is  of  the  form 

Dt(x)  =  JjX?)'i>'(x-,4\o<ji))  (9) 

i  =  l 


where  n  is  known,  and  the  A,0  sum  to  1.  We  are  implicitly  assuming  here  that  the  data  come  from  a  single 
class,  and  we  are  trying  to  estimate  the  density  for  that  class.  We  wish  to  estimate  the  parameter  vector 
0T  which  consists  of  X,fi,  ando.  Let  us  also  assume  for  the  moment  that  D  is  stationary.  A  standard 
technique  for  estimating  the  parameter  vector  0T  is  to  maximize  the  loglikelihood.  We  will  write  an 
estimate  for  D(x)  with  parameter  vector  0  as  D(x;  6).  Following  Titterington  [1984],  we  set 

S(x,0)  =  A  log  [D{x,0)\  ■  (10) 
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and  use  these  likelihood  equations  to  obtain  the  update  formula 


&t*  1  =0  +  Qt !  5(x<+i; dt)  (11) 

This  can  be  seen  to  be  a  gradient  ascent  on  the  loglikelihood  surface,  and  under  certain  conditions  on 
<*t  at  and  D,  we  will  have  convergence  to  the  target  density  (see  Titterington  [1984]).  An  example  of  this 
kind  of  approximation  formula,  which  will  be  used  below,  is  the  following  set  of  recursive  update 
equations  (written  in  component  form): 

0<»>_  id)  £>f<0(*f) 

Pl  -1‘  <12) 

4?, -if® +■>!?,  (Pi?.  -4")  <>3) 

a.!?.  do 

*  «a  Pl?, [<»!?, -/-!,J1)fri;i)  -(.!'■*>) os) 

We  use  the  following  convention:  the  first  element  of  a  superscripted  list  corresponds  to  the  number  of 
the  mixture  component,  and  following  elements  correspond  to  the  indices  of  the  parameter  in  question. 
Thus,  refers  to  the  /  element  of  the  ‘th  mean  vector,  and  is  the  (/,  k)th  entry  in  the 

i,h  covariance  matrix.  In  vector  notation,  the  mean  and  covariance  update  rules  become 

/»i?i  ~Mt°)  (16) 

-A**0)7*-^0]  (17) 


We  will  call  Eq.  12-15  and  Eq.  17  the  “update  rule”  t//(*f+i;0/)- 


An  obvious  choice  for  a,\  (in  the  stationary  case) 


is  \j  =  i  J  .  Ifn, 


the  number  of  components 


in  the  mixture,  is  1,  this  is  just  1/f  +  1,  which  is  the  inverse  of  the  number  of  data  points.  In  general,  this 
can  be  thought  of  as  the  number  of  points  used  to  update  component  i.  If  the  density  D  is  not  known  to 
be  a  mixture  of  Gaussians,  however,  one  might  still  wish  to  use  the  above  formulation  to  find  an 
approximation  to  the  density  by  such  a  mixture.  In  some  sense,  the  kernel  estimator  is  an  extreme  of  this 
point  of  view.  Thus,  one  could  choose  m  “large  enough,”  start  the  estimate  with  some  initial  6,  and  then 
recursively  update  the  estimate  using  the  above  formula.  Assuming  that  the  density  is  well  approximated 
by  such  a  mixture  (which  is  the  case  if  m  is  large)  and  that  a  reasonable  initial  estimate  is  used,  this 
procedure  will  result  in  a  good  estimate  of  the  density. 

If  an  approximation  of  the  density  D  by  a  mixture  as  above  is  used,  the  number  of  components,  m, 
and  an  initial  estimate  must  be  chosen.  It  would  be  helpful  (and  in  fact  is  essential  in  the  nonstationary 
case)  if  the  algorithm  could  choose  m  and  the  initial  estimate  recursively  from  the  data.  It  is  this  which 
motivates  the  algorithm  described  below. 
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In  order  to  develop  an  estimate  of  the  form  given  by  Eq.  9,  we  will  use  a  combination  of  the  above 
finite  mixture  modelling  algorithm  (the  update  rule)  and  a  dynamic  allocation  procedure  which  allows  the 
algorithm  to  increase  the  number  of  terms  in  our  model  if  our  current  estimate  fails  to  account  for  the 
current  observation.  That  is,  we  will  add  a  new  term  to  the  mixture,  with  mean  /x  =  xt,  if  circumstances 
indicate  that  this  is  necessary.  Otherwise,  we  will  update  our  estimate  Qt  (and  hence  D,  ).  We  will  call 
this  “create  rule”  Ct(xt+  1 ,6t,k)  and  will  describe  it  shortly.  Our  s.a.  procedure  now  becomes 

0/+i  =  Ot  +  [I  -  Ut(,Xt*u&t)  +  Pt(xM-,6t)  Ct(xit\,6t<  k)  (18) 


/*(•)  in  Eq.  18  is  the  “decision-to-add-component”  function  and  takes  on  values  1  or  0,  depending 
on  whether  the  decision  is  to  add  a  component  or  not. 

Assuming  that  the  system  has  decided  to  add  a  component,  the  create  rule  C(«)  for  the  single-class 
case  is 


(m+i)  _ 

Pit  i  -  */♦  l 

(19) 

at*  l  - 

(20) 

=  a;;\ < i - w,)  o-i . m) 

(21) 

*!”’ = »> 

(22) 

m  =  m  +  1 

(23) 

An  obvious  choice  for  wt  is  at.  Thus,  the  new  component  is  centered  at  the  observation,  given  an  initial 
covariance  (which  may  be  user-defined  or  derived  from  the  components  in  the  neighborhood  of  the 
observation)  and  a  small  proportion.  All  the  other  proportions  must  be  updated  so  that  they  sum  to  1,  but 
otherwise  the  other  components  are  unaffected.  For  the  multiclass  modelling  case,  C(»)  becomes  a  bit 
more  involved.  This  situation  will  be  discussed  below. 

When  the  decision  is  made  to  add  a  component  for  each  data  point,  the  estimate  is  similar  to  the 
kernel  estimator  (one-dimensional  densities  are  used  for  clarity): 


D,(x) 


Putting  this  into  a  more  familiar  notation,  we  have 


Dn(x) 


(24) 


(25) 


Dn(x )  is  the  estimator  considered  in  Wolverton  and  Wagner  [1969J  and  Wegman  and  Davies  [1979]. 
Its  consistency  is  easily  established.  Thus,  in  this  extreme  case,  the  algorithm  is  consistent  for  reasonable 
choices  of  the  system  variables  (in  this  case,  w,  a,  and  K).  It  is  reasonable,  therefore,  that  since  the 
update  rule  is  a  recursive  maximum  likelihood  estimator  (see  Titterington  [1984],  Dempster  el  al-  [1977], 
Redner  and  Walker  [1984]),  and  so  in  some  sense  improves  the  estimate  between  the  addition  of  new 
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components,  that  if  the  decision  to  add  a  component  is  properly  chosen,  the  overall  system  will  be 
consistent.  The  performance  of  the  estimator  obtained  by  using  recursive  updates,  as  opposed  to  merely 
always  adding  another  term,  is  important.  The  reduction  in  the  number  of  terms  required  in  the  estimate 
is  a  storage  and  computational  advantage. 

The  decision  to  add  a  component  /*(•)  an  be  made  in  a  number  of  ways.  One  way  is  to  check  the 
Mahalanobis  distance  from  the  observation  to  each  of  the  components,  and  if  the  minimum  of  these 
exceeds  a  threshold  (called  the  create  threshold),  then  the  point  is  in  some  sense  “too  far  away”  from  the 
other  components,  and  a  new  component  should  be  created.  Recall  that  the  Mahalanobis  distance 
between  a  point  x  and  a  component  with  mean  /i(,)  and  covariance  o(<)  is  defined  by 

MM(x)  =  (x (26) 


Thus,  if  the  create  threshold  is  C,  then  we  create  a  new  component  at  the  point  xt+\  if 

M(.xui)  =  min  [Af0)(xf+i)]  >  C  (27) 

i 


Another  approach  is  to  let  /’(•)  be  determined  probabilistically.  That  is,  let  g{*\(jc)  be  the  probability 
that  a  point  drawn  from  the  distribution  defined  by  the  ith  component  is  further  from  the  mean  than  Jt: 

qui(x)  =  Prob(|X(,)  -/x(,)|  >  \x  ~/iU) |)  (28) 


Let  qh be  the  maximum  of  these,  and  let  Pui (•)  be  a  random  variable  which  takes  on  the  value  0 
with  probability  qt+i(xt+i)  and  1  with  probability  1  -  qni(xni)-  Thus,  if  the  point  is  close  to  the  mean  of 
an  existing  component,  the  system  is  unlikely  to  create  a  new  component,  and  if  it  is  far  from  any  existing 
component,  it  will  most  likely  create  a  new  component.  This  is  the  same  type  of  rule  as  the  one  described 
above,  but  it  eliminates  the  need  for  the  user-defined  create  threshold. 

Other  approaches  could  use  the  estimated  density  directly,  or  the  density  of  the  observations  in  the 
region  of  the  new  observation  (computed  recursively,  e.g.,  by  some  function  of  the  proportions  of  the 
neighboring  components  and  their  relative  distances  from  the  new  observation). 

Windowing 

The  most  common  technique  for  modifying  a  recursive  system  to  allow  the  estimation  of  a 
nonstationary  distribution  is  to  use  a  window  on  the  observations.  This  amounts,  in  the  simple  case,  to 
setting  at  to  some  small  constant.  This  puts  an  exponential  window  on  the  data,  forcing  the  system  to 
always  treat  the  newest  observation  with  a  certain  amount  of  respect.  This  approach  obviously  precludes  a 
system  from  being  consistent,  but  consistency  in  modelling  nonstationary  distributions  is  meaningless. 
Consider  the  general  update  formula  mentioned  above,  namely 

6t*i  -0  +  a  t  S(xt+u  &t)  (29) 
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The  conditions  placed  upon  a  t  'n  order  for  Eq.  29  to  be  a  consistent  estimation  routine  are  basically  (1) 
2Q!  t  =  oo  and  (2)  2a  f  2  <  » .  For  example,  a  t  =  t'1.  To  implement  a  windowed  estimator  and  address 
nonstationarities,  consider  the  perturbation  of  Eq.  29  to  be 

0ni  =  0i  +  Pt  S{xt*\,0t)  (30) 


where  the  f}t  are  such  that  B  >0  is  a  lower  bound  for (},.  Then  Eq.  30  is  a  windowed  s.a.  scheme  suitable 
for  modelling  nonstationary  densities.  Note  that  2/?*  =  <»,  so  consistency  is  unobtainable.  However,  this 
provides  a  window  on  the  data,  allowing  the  estimator  to  adapt  to  changes  in  the  underlying  density.  As  an 
example,  let  ft,  =  max  {r1,  fT1}  for  constant  B  >  0. 

It  is  important  to  note  that  asymptotic  considerations,  as  detailed  above,  become  moot  when  dealing 
with  nonstationarities.  What  is  of  import  is  the  level  of  performance  that  can  be  expected  under  given 
conditions.  For  instance,  the  variance  and  bias  that  can  be  expected  for  given  window  widths  under 
stationary  assumptions.  This  information  allows  one  to  evaluate  the  output  of  the  system. 

Extension  To  The  Multiclass  Case 

The  preceding  discussion  involved  modelling  a  single-class  distribution.  For  the  general  pattern 
recognition  problem  this  is  clearly  insufficient.  We  must  have  a  method  for  modelling  N  classes  within  the 
framework  of  Eq.  9.  Consider  N  separate  distributions  of  the  form  of  Eq.  9  That  is 

N 

Dt «  =  (31) 

=  i 

Each  of  the  component  densities  will  be  modelled  as  a  mixture  as  above.  Assume  that  a  supervised 
training  set  is  available,  so  that  an  initial  estimate  can  be  made  for  each  of  the  classes.  Assume  further  that 
all  the  classes  are  represented  in  the  initial  training  set.  We  can  then  model  the  individual  class  densities  as 
mixtures  as  above.  However,  in  the  general  case,  where  there  may  be  classes  which  are  not  represented  by 
the  initial  training  set,  we  need  a  mechanism  for  determining  whether  a  point  belongs  to  an  existing  class, 
or  whether  a  new  “unknown”  class  should  be  created. 

Let  A  be  the  scaled  normal:  A=  {2n)mn det[2(0]<fc').  We  now  require  that  the  create  function 
utilize  an  inclusion  test  I(A('\  Xf+i),  which  will  be  used  to  allow  Eq.  31  to  develop  new,  unknown  classes 
OY*  recursively.  (Note  that,  for  our  purposes,  a  decision  to  update  [/’(•)  =  0]  implies  no  need  for  any 
consideration  of  inclusion:  proportional  update  takes  care  of  itself.)  /(•)  can  be  thought  of  as  a 
“coveredness-coefficient”  and  is  used  to  determine  if  the  present  observation  is  predicted  by  one  of  the 
terms  in  the  summation  of  Eq.  31.  (This  is  analogous  to  a  tail-test,  with  the  proviso  that  we  are  testing 
individual  terms  in  the  mixture  of  Eq.  31  rather  than  the  classes.)  If  the  model  of  Eq.  31  fails  this  test  for 
all  classes,  and  the  creation  of  a  new  term  is  indicated  by  Eq.  27,  then  the  newly  created  term  will  be 
considered  the  first  member  of  a  new  unknown  class  In  this  case,  C(»)  is  as  above,  for  the 

single-class  case.  If,  on  the  other  hand,  the  model  passes  the  inclusion  test  for  the  current  observation  for 
one  or  more  classes,  then  the  new  term  will  be  incorporated  into  the  class  or  classes  for  which  the 
observation  passes  the  test.  This  case  will  be  discussed  further. 

Note  that  for  Dn  defined  as  in  Eq.  25  lim  PDn(e)  =  Pdopt(e)-  Thus,  from  an  asymptotic  point  of 
view,  there  is  justification  in  using  this  density-estimation  technique  for  constructing  our  decision  function 
in  the  fully  supervised,  multiclass  case.  In  addition,  we  have  implications  for  phase  one  of  unsupervised 
learning:  the  development  of  the  overall  density  estimate  D. 
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Specifically,  we  let  /(A('\;cf+i)  be  a  random  variable  such  that  I(A(,),xt+i)  =  1  if  A('\  (x<+i)  >  Tj 
and  I(A(i>,  (x(+1)  =  0  if  A(,)(x/+i)  <  Tj  for  some  threshold  7/  £  1.  If  Zl(A(()l *,tl)  =  0,  an  “unknown 
class”  will  be  created,  as  described  above.  If,  on  the  other  hand,  2  I(A'(,),  jc<+1)  is  nonzero,  C(»)  then  is 
as  in  the  single-class  case  (Eq.  19  and  23),  with  the  following  exception:  Cui  =  2  I(A(0,  terms  are 
created,  one  corresponding  to  each  class  C(,)  3  I(A('\jc/+i)  =  1,  each  with  A  =  VV,/Cf+i  (compare  Eq. 
22). 

I(A('\  Xt)  attempts  to  recursively  identified  the  modes  or  terms  in  the  unsupervised  data,  and  as  such 
cannot  be  perfect.  In  implementation,  it  is  possible  to  use  a  number  of  parameters  to  develop  /(•).  In 
particular,  a  “minimum  variance”  parameter,  a*,  can  be  used  to  aid  in  making  the  inclusion  decisions. 
For  nonasymptotic  reasons,  a  minimum  distance  (Mahalanobis)  can  be  useful.  (Note  that  these 
parameters  need  not  be  constant  over  the  entire  feature  space!)  Finally,  uncertainty  considerations  can  be 
made.  For  instance,  when  many  observations  (supervised  or  unsupervised)  have  been  made  in  a  given 
sector  of  feature  space,  we  have  more  confidence  in  our  estimate.  We  can  therefore  reduce  our 
dependence  on  u.l.  in  these  cases.  This  is  for  the  stationary  case.  In  the  nonstationary  case,  we  may 
indeed  wish  to  suspend  u.l.  in  certain  instances  until  a  “change  detector”  (such  as  in  Bansal  and 
Papantoni-Kazakos  [1986,  1989])  indicates  that  our  estimate  is  no  longer  valid.  While  these 
considerations  are  indeed  important,  they  deal  mainly  with  application-specific  issues.  The  point  of 
mentioning  them  is  that  they  are  imperative  precisely  because  there  can  be  no  “perfect”  u.l.  machinel 

As  an  aside,  a  classification  threshold  T  may  be  beneficial  for  the  ultimate  response  of  the  system, 
although  this  is  technically  unnecessary.  The  problem  arises  when  no  known  class  is  “close”  to  the  current 
observation.  In  this  case  (which,  by  the  way,  may  or  may  not  imply  the  creation  of  a  new,  unknown  class), 
the  probability  that  the  observation  originates  from  the  closest  class  (in  a  Mahalanobis  sense)  can  be  high 
while  the  likelihood  is  quite  low.  This  scenario,  not  uncommon  in  practice,  can  yield  misleading  system 
responses.  However,  a  classification  threshold  0  £  T  can  be  implemented  such  that 
max  D(0(x<)  <  T  implies  that  the  system  response  will  include,  along  with  the  probabilistic  ranking  of  the 
classes  of  origin,  the  proviso  that  the  observation  is  either  an  outlyer  or  originates  from  an  unknown  and  as 
yet  unmodeled  class. 


DISCUSSION 


We  have  presented  a  statistical  pattern  recognition  approach  to  both  supervised  and  unsupervised 
learning  which  allows  a  system  to  update  and  improve  its  model  of  the  dynamic  environment  based  on 
observations  for  which  truth  is  unknown.  The  adaptive  mixtures  presented  here  are  interesting  in  terms  of 
basic  density  estimation  as  well  as  learning  applications.  While  unsupervised  learning  can  never  be 
completely  reliable,  the  approach  outlined  above  can  be  tailored  to  individual  applications  in  such  a  way 
as  to  allow  a  new  dimension  in  pattern  recognition  systems. 

For  the  supervised  case,  analysis  of  the  asymptotic  performance  of  Eq.  18  as  a  density-estimation 
technique  is  found  in  Priebe  and  Marchette  [1990a].  The  relationship  of  this  density  estimation  to 
discrimination  is  discussed  in  Priebe  and  Marchette  [1990b].  Analysis  of  the  extension  of  this  work  to  the 
more  general  task  that  we  call  classification  (unsupervised  learning)  is  ongoing.  While  general 
performance  statements  in  this  area  will  require  severe  restrictions  on  the  true  densities  £>('\  we  have 
found  that  in  specific  instances,  Eq.  18  is  capable  of  all  three  stages  of  unsupervised  learning  discussed  at 
•  the  end  of  the  section  titled  Formulation  of  the  Learning  Problem:  (1)  updating  density  estimates 

D ,  associated  with  known  classes  C(,),  (2)  developing  estimate  Du  for  the  overall  density  of  unknown 
observations,  and  (3)  partitioning  Dv  into  corresponding  to  individual  unknown  classes  U®. 
However,  in  the  first  and  third  of  these  especially,  there  are  no  guarantees,  based  solely  on  unsupervised 
learning,  that  the  estimated  model  corresponds  to  the  true  £><0. 
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